First load the required packages for this walk through.
pkg_list <- c("tidyverse", "psych", "ez", "rcompanion", "knitr", "car",
"afex", "ggfortify", "Hmisc", "emmeans", "jtools", "apaTables")
lapply(pkg_list, library, quietly = TRUE, character.only = TRUE)
## [[1]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [12] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "knitr" "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble"
## [12] "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "car" "carData" "knitr" "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr"
## [12] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[7]]
## [1] "afex" "lme4" "Matrix" "car" "carData" "knitr" "rcompanion" "ez" "psych" "forcats" "stringr"
## [12] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils"
## [23] "datasets" "methods" "base"
##
## [[8]]
## [1] "ggfortify" "afex" "lme4" "Matrix" "car" "carData" "knitr" "rcompanion" "ez" "psych" "forcats"
## [12] "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [23] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "Hmisc" "Formula" "survival" "lattice" "ggfortify" "afex" "lme4" "Matrix" "car" "carData" "knitr"
## [12] "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [23] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "emmeans" "Hmisc" "Formula" "survival" "lattice" "ggfortify" "afex" "lme4" "Matrix" "car" "carData"
## [12] "knitr" "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble"
## [23] "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[11]]
## [1] "jtools" "emmeans" "Hmisc" "Formula" "survival" "lattice" "ggfortify" "afex" "lme4" "Matrix" "car"
## [12] "carData" "knitr" "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [23] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[12]]
## [1] "apaTables" "jtools" "emmeans" "Hmisc" "Formula" "survival" "lattice" "ggfortify" "afex" "lme4" "Matrix"
## [12] "car" "carData" "knitr" "rcompanion" "ez" "psych" "forcats" "stringr" "dplyr" "purrr" "readr"
## [23] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base"
Optional package
# devtools::install_github("ropensci/skimr")
library("skimr")
The data set being used for this walk through is automatically loaded when psych is loaded. You can examine what each of the columns represent by looking at the data set’s help page with ?sat.act.
Next, just to get a snapshot of the data, we can use head and str. Notice that gender and education are integer columns despite being categorical variables.
head(sat.act)
## gender education age ACT SATV SATQ
## 29442 2 3 19 24 500 500
## 29457 2 3 23 35 600 500
## 29498 2 3 20 21 480 470
## 29503 1 4 27 26 550 520
## 29504 1 2 33 31 600 550
## 29518 1 5 26 28 640 640
str(sat.act)
## 'data.frame': 700 obs. of 6 variables:
## $ gender : int 2 2 2 1 1 1 2 1 2 2 ...
## $ education: int 3 3 3 4 2 5 5 3 4 5 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
Since some of the analyses we will be running require categorical variables, we first need to preprocess some of the numeric columns into categories. Let’s quickly add two columns to the data set for factored versions of of gender and education.
sat_act <- sat.act %>%
# mutate(., gender_fac = ifelse(gender == 1, "male", "female") %>%
mutate(.,
gender_fac = case_when(gender == 1 ~ "male",
gender == 2 ~ "female"),
education_fac = case_when(education == 0 ~ "none",
education == 1 ~ "some_hs",
education == 2 ~ "high_school",
education == 3 ~ "some_college",
education == 4 ~ "college",
education == 5 ~ "graduate")
)
Before analysis, we can obtain descriptive statistics quickly by utilizing the describe() function in the psych package.
describe(sat_act)
## sat_act
##
## 8 Variables 700 Observations
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## gender
## n missing distinct Info Mean Gmd
## 700 0 2 0.685 1.647 0.4574
##
## Value 1 2
## Frequency 247 453
## Proportion 0.353 0.647
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## education
## n missing distinct Info Mean Gmd
## 700 0 6 0.922 3.164 1.532
##
## Value 0 1 2 3 4 5
## Frequency 57 45 44 275 138 141
## Proportion 0.081 0.064 0.063 0.393 0.197 0.201
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
## 700 0 48 0.994 25.59 9.56 17 18 19 22 29 39 46
##
## lowest : 13 14 15 16 17, highest: 57 58 61 62 65
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## ACT
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
## 700 0 23 0.995 28.55 5.407 20 22 25 29 32 35 35
##
## lowest : 3 15 16 17 18, highest: 32 33 34 35 36
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## SATV
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
## 700 0 70 0.997 612.2 126 400 450 550 620 700 750 780
##
## lowest : 200 240 245 300 330, highest: 785 790 797 799 800
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## SATQ
## n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
## 687 13 72 0.998 610.2 129.9 400 450 530 620 700 750 777
##
## lowest : 200 230 250 300 320, highest: 780 790 795 799 800
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## gender_fac
## n missing distinct
## 700 0 2
##
## Value female male
## Frequency 453 247
## Proportion 0.647 0.353
## ------------------------------------------------------------------------------------------------------------------------------------------------------
## education_fac
## n missing distinct
## 700 0 6
##
## Value college graduate high_school none some_college some_hs
## Frequency 138 141 44 57 275 45
## Proportion 0.197 0.201 0.063 0.081 0.393 0.064
## ------------------------------------------------------------------------------------------------------------------------------------------------------
The describe() function also allows to descriptive statistics by a grouping variable using the partner function describeBy(). If we wanted to get descriptive statistics by gender_fac we simply pass that column to an argument.
describeBy(sat_act, group = c("gender_fac"))
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 453 2.00 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 453 3.26 1.35 3 3.4 1.48 0 5 5 -0.74 0.27 0.06
## age 3 453 25.45 9.37 22 23.7 5.93 13 65 52 1.77 3.03 0.44
## ACT 4 453 28.42 4.69 29 28.6 4.45 15 36 21 -0.39 -0.42 0.22
## SATV 5 453 610.66 112.31 620 617.9 103.78 200 800 600 -0.65 0.42 5.28
## SATQ 6 442 596.00 113.07 600 602.2 133.43 200 800 600 -0.58 0.13 5.38
## gender_fac* 7 453 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 453 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 247 1.0 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00
## education 2 247 3.0 1.54 3 3.12 1.48 0 5 5 -0.54 -0.60 0.10
## age 3 247 25.9 9.74 22 24.23 5.93 14 58 44 1.43 1.43 0.62
## ACT 4 247 28.8 5.06 30 29.23 4.45 3 36 33 -1.06 1.89 0.32
## SATV 5 247 615.1 114.16 630 622.07 118.61 200 800 600 -0.63 0.13 7.26
## SATQ 6 245 635.9 116.02 660 645.53 94.89 300 800 500 -0.72 -0.12 7.41
## gender_fac* 7 247 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 247 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
We can add multiple grouping variables in the same manner.
describeBy(sat_act, group = c("gender_fac", "education_fac"))
##
## Descriptive statistics by group
## gender_fac: female
## education_fac: college
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 87 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 87 4.0 0.00 4 4.0 0.00 4 4 0 NaN NaN 0.00
## age 3 87 29.1 7.76 26 27.8 5.93 21 52 31 1.26 0.70 0.83
## ACT 4 87 29.4 4.32 30 29.6 4.45 19 36 17 -0.27 -0.67 0.46
## SATV 5 87 615.0 106.62 620 621.4 88.96 300 800 500 -0.58 0.28 11.43
## SATQ 6 86 597.6 106.24 600 605.8 118.61 300 800 500 -0.71 0.20 11.46
## gender_fac* 7 87 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 87 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: college
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 51 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 51 4.0 0.00 4 4.0 0.00 4 4 0 NaN NaN 0.00
## age 3 51 32.2 9.03 29 30.8 8.90 23 57 34 1.20 0.63 1.27
## ACT 4 51 28.9 4.42 29 29.3 4.45 16 36 20 -0.74 0.12 0.62
## SATV 5 51 620.3 81.72 620 623.3 88.96 430 800 370 -0.26 -0.29 11.44
## SATQ 6 51 635.9 104.12 640 642.5 88.96 400 800 400 -0.46 -0.45 14.58
## gender_fac* 7 51 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 51 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: female
## education_fac: graduate
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 95 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 95 5.0 0.00 5 5.0 0.00 5 5 0 NaN NaN 0.00
## age 3 95 34.3 10.67 30 32.7 8.90 22 65 43 1.18 0.61 1.09
## ACT 4 95 29.0 4.19 29 29.1 4.45 18 36 18 -0.31 -0.73 0.43
## SATV 5 95 620.4 95.72 620 623.6 74.13 300 800 500 -0.46 0.43 9.82
## SATQ 6 93 606.7 105.55 600 608.9 148.26 350 800 450 -0.14 -0.94 10.95
## gender_fac* 7 95 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 95 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: graduate
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 46 1.0 0.00 1.0 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 46 5.0 0.00 5.0 5.0 0.00 5 5 0 NaN NaN 0.00
## age 3 46 35.9 10.00 35.5 35.1 11.12 22 58 36 0.47 -0.67 1.48
## ACT 4 46 30.8 3.11 32.0 30.9 2.97 25 36 11 -0.38 -0.81 0.46
## SATV 5 46 623.5 99.58 645.0 631.2 96.37 390 770 380 -0.61 -0.43 14.68
## SATQ 6 46 657.8 89.61 680.0 661.7 103.78 475 800 325 -0.45 -0.77 13.21
## gender_fac* 7 46 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 46 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: female
## education_fac: high_school
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 21 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 21 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## age 3 21 30.1 12.22 26 28.4 10.38 18 57 39 1.16 0.01 2.67
## ACT 4 21 27.3 5.23 28 27.5 4.45 15 36 21 -0.32 -0.34 1.14
## SATV 5 21 593.6 115.34 600 598.2 118.61 375 770 395 -0.44 -0.91 25.17
## SATQ 6 20 586.5 120.96 585 587.8 163.09 375 800 425 0.01 -1.11 27.05
## gender_fac* 7 21 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 21 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: high_school
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 23 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 23 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## age 3 23 25.3 8.68 22 23.6 4.45 18 55 37 1.94 3.63 1.81
## ACT 4 23 26.6 6.39 28 27.7 4.45 3 32 29 -2.14 5.39 1.33
## SATV 5 23 560.0 152.29 600 570.5 148.26 200 800 600 -0.53 -0.59 31.75
## SATQ 6 23 569.1 160.65 600 575.8 177.91 300 800 500 -0.36 -1.44 33.50
## gender_fac* 7 23 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 23 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: female
## education_fac: none
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 30 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 30 0.0 0.00 0 0.0 0.00 0 0 0 NaN NaN 0.00
## age 3 30 17.0 1.07 17 17.1 0.74 13 18 5 -1.75 4.13 0.19
## ACT 4 30 26.1 5.06 26 25.9 5.93 15 36 21 0.08 -0.56 0.92
## SATV 5 30 595.3 123.46 595 597.1 148.26 350 800 450 -0.09 -0.81 22.54
## SATQ 6 29 599.7 123.20 600 601.0 148.26 333 800 467 -0.09 -0.99 22.88
## gender_fac* 7 30 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 30 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: none
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 27 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 27 0.0 0.00 0 0.0 0.00 0 0 0 NaN NaN 0.00
## age 3 27 16.9 1.04 17 17.0 1.48 14 18 4 -0.86 0.34 0.20
## ACT 4 27 29.0 5.00 29 29.2 5.93 20 36 16 -0.30 -1.13 0.96
## SATV 5 27 640.1 132.24 670 646.2 177.91 400 800 400 -0.29 -1.40 25.45
## SATQ 6 27 642.7 127.90 660 647.9 177.91 400 800 400 -0.24 -1.36 24.61
## gender_fac* 7 27 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 27 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: female
## education_fac: some_college
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 195 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 195 3.0 0.00 3 3.0 0.00 3 3 0 NaN NaN 0.00
## age 3 195 21.1 4.75 20 20.0 1.48 17 46 29 3.41 12.83 0.34
## ACT 4 195 28.2 4.78 29 28.4 4.45 16 36 20 -0.46 -0.47 0.34
## SATV 5 195 610.0 119.78 620 619.6 118.61 200 800 600 -0.81 0.66 8.58
## SATQ 6 190 590.9 114.46 600 598.9 118.61 200 800 600 -0.72 0.38 8.30
## gender_fac* 7 195 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 195 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: some_college
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 80 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 80 3.0 0.00 3 3.0 0.00 3 3 0 NaN NaN 0.00
## age 3 80 20.8 3.06 20 20.3 1.48 17 34 17 2.00 4.55 0.34
## ACT 4 80 28.6 5.03 30 28.8 5.19 17 36 19 -0.45 -0.92 0.56
## SATV 5 80 617.4 111.79 630 624.5 111.19 300 800 500 -0.62 -0.06 12.50
## SATQ 6 79 642.6 118.28 680 653.1 118.61 300 800 500 -0.81 -0.17 13.31
## gender_fac* 7 80 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 80 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: female
## education_fac: some_hs
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 25 2.0 0.00 2 2.0 0.00 2 2 0 NaN NaN 0.00
## education 2 25 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## age 3 25 19.3 4.62 18 18.1 0.00 17 37 20 2.86 7.27 0.92
## ACT 4 25 28.1 5.13 27 28.3 4.45 18 36 18 -0.21 -0.78 1.03
## SATV 5 25 597.0 119.38 610 600.8 133.43 350 799 449 -0.31 -0.95 23.88
## SATQ 6 24 592.5 140.83 625 606.6 111.19 230 799 569 -0.93 0.20 28.75
## gender_fac* 7 25 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 25 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## ----------------------------------------------------------------------------------------------------------------
## gender_fac: male
## education_fac: some_hs
## vars n mean sd median trimmed mad min max range skew kurtosis se
## gender 1 20 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## education 2 20 1.0 0.00 1 1.0 0.00 1 1 0 NaN NaN 0.00
## age 3 20 19.6 6.12 18 18.2 0.00 17 45 28 3.55 11.78 1.37
## ACT 4 20 26.7 7.11 28 27.1 8.15 15 35 20 -0.30 -1.51 1.59
## SATV 5 20 603.0 141.24 600 611.2 185.32 300 780 480 -0.39 -1.12 31.58
## SATQ 6 19 625.8 95.87 650 630.9 88.96 400 765 365 -0.66 -0.47 21.99
## gender_fac* 7 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## education_fac* 8 20 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
Optional While describe is useful for quick glimpses at the data, it does not always play nicely with the tidyverse. If you wanted to stick with a more “pure” tidyverse approach to descriptive statistics, you can use skimr.
sat_act %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 8
##
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## education_fac 0 700 700 4 12 0 6
## gender_fac 0 700 700 4 6 0 2
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ACT 0 700 700 28.55 4.82 3 25 29 32 36 ▁▁▁▁▃▅▇▇
## age 0 700 700 25.59 9.5 13 19 22 29 65 ▇▇▂▂▁▁▁▁
## education 0 700 700 3.16 1.43 0 3 3 4 5 ▂▁▁▁▇▁▅▅
## gender 0 700 700 1.65 0.48 1 1 2 2 2 ▅▁▁▁▁▁▁▇
## SATQ 13 687 700 610.22 115.64 200 530 620 700 800 ▁▁▁▅▃▇▇▅
## SATV 0 700 700 612.23 112.9 200 550 620 700 800 ▁▁▁▃▃▇▅▃
While skim() doesn’t provide as much descriptive detail as describe, it does provide the basics, and a useful visual representation of the data.
Similarly, for obtaining descriptives by grouping variables you can utilize the group_by() function.
sat_act %>%
group_by(., gender_fac) %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 8
## group variables: gender_fac
##
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
## gender_fac variable missing complete n min max empty n_unique
## female education_fac 0 453 453 4 12 0 6
## male education_fac 0 247 247 4 12 0 6
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## gender_fac variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## female ACT 0 453 453 28.42 4.69 15 25 29 32 36 ▁▁▂▅▇▅▇▆
## female age 0 453 453 25.45 9.37 13 19 22 28 65 ▆▇▃▂▁▁▁▁
## female education 0 453 453 3.26 1.35 0 3 3 4 5 ▁▁▁▁▇▁▃▃
## female gender 0 453 453 2 0 2 2 2 2 2 ▁▁▁▇▁▁▁▁
## female SATQ 11 442 453 596 113.07 200 500 600 683 800 ▁▁▁▅▃▇▆▃
## female SATV 0 453 453 610.66 112.31 200 550 620 700 800 ▁▁▁▃▃▇▅▃
## male ACT 0 247 247 28.79 5.06 3 25 30 32.5 36 ▁▁▁▁▂▃▆▇
## male age 0 247 247 25.86 9.74 14 19 22 30 58 ▇▇▃▂▁▁▁▁
## male education 0 247 247 3 1.54 0 2 3 4 5 ▃▂▁▂▇▁▅▅
## male gender 0 247 247 1 0 1 1 1 1 1 ▁▁▁▇▁▁▁▁
## male SATQ 2 245 247 635.87 116.02 300 570 660 720 800 ▁▂▂▃▅▅▇▆
## male SATV 0 247 247 615.11 114.16 200 540 630 700 800 ▁▁▂▅▃▇▇▅
sat_act %>%
group_by(., education_fac) %>%
skimr::skim(.)
## Skim summary statistics
## n obs: 700
## n variables: 8
## group variables: education_fac
##
## ── Variable type:character ─────────────────────────────────────────────────────────────────────────────────────────────────────
## education_fac variable missing complete n min max empty n_unique
## college gender_fac 0 138 138 4 6 0 2
## graduate gender_fac 0 141 141 4 6 0 2
## high_school gender_fac 0 44 44 4 6 0 2
## none gender_fac 0 57 57 4 6 0 2
## some_college gender_fac 0 275 275 4 6 0 2
## some_hs gender_fac 0 45 45 4 6 0 2
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────────────────────────────
## education_fac variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## college ACT 0 138 138 29.26 4.35 16 27 30 32 36 ▁▂▂▃▆▇▆▆
## college age 0 138 138 30.24 8.36 21 24 28 34.75 57 ▇▅▂▂▁▁▁▁
## college education 0 138 138 4 0 4 4 4 4 4 ▁▁▁▇▁▁▁▁
## college gender 0 138 138 1.63 0.48 1 1 2 2 2 ▅▁▁▁▁▁▁▇
## college SATQ 1 137 138 611.85 106.71 300 565 610 690 800 ▁▂▁▃▇▅▇▃
## college SATV 0 138 138 616.95 97.88 300 572.5 620 680 800 ▁▁▁▃▇▇▅▃
## graduate ACT 0 141 141 29.6 3.95 18 27 30 32 36 ▁▁▃▆▆▅▇▅
## graduate age 0 141 141 34.83 10.44 22 27 33 40 65 ▇▅▅▃▁▁▂▁
## graduate education 0 141 141 5 0 5 5 5 5 5 ▁▁▁▇▁▁▁▁
## graduate gender 0 141 141 1.67 0.47 1 1 2 2 2 ▃▁▁▁▁▁▁▇
## graduate SATQ 2 139 141 623.63 103.09 350 535 640 700 800 ▁▂▆▅▆▆▇▆
## graduate SATV 0 141 141 621.4 96.65 300 570 630 700 800 ▁▁▂▅▇▇▆▃
## high_school ACT 0 44 44 26.98 5.81 3 25 28 31 36 ▁▁▁▁▃▇▇▅
## high_school age 0 44 44 27.57 10.68 18 20 25 28.75 57 ▇▆▁▂▁▁▁▂
## high_school education 0 44 44 2 0 2 2 2 2 2 ▁▁▁▇▁▁▁▁
## high_school gender 0 44 44 1.48 0.51 1 1 1 2 2 ▇▁▁▁▁▁▁▇
## high_school SATQ 1 43 44 577.21 142.18 300 450 600 700 800 ▂▃▃▃▆▂▇▃
## high_school SATV 0 44 44 576.02 135.43 200 487.5 600 682.5 800 ▁▁▃▃▂▇▇▂
## none ACT 0 57 57 27.47 5.21 15 24 28 31 36 ▁▂▅▃▇▃▃▅
## none age 0 57 57 16.95 1.04 13 17 17 18 18 ▁▁▁▁▃▁▇▆
## none education 0 57 57 0 0 0 0 0 0 0 ▁▁▁▇▁▁▁▁
## none gender 0 57 57 1.53 0.5 1 1 2 2 2 ▇▁▁▁▁▁▁▇
## none SATQ 1 56 57 620.43 126.21 333 515 620 712.5 800 ▁▁▇▅▅▃▅▇
## none SATV 0 57 57 616.51 128.53 350 500 600 710 800 ▂▂▆▃▅▃▆▇
## some_college ACT 0 275 275 28.29 4.85 16 25 29 32 36 ▁▃▃▅▅▇▆▆
## some_college age 0 275 275 21.01 4.32 17 19 20 21 46 ▇▃▁▁▁▁▁▁
## some_college education 0 275 275 3 0 3 3 3 3 3 ▁▁▁▇▁▁▁▁
## some_college gender 0 275 275 1.71 0.46 1 1 2 2 2 ▃▁▁▁▁▁▁▇
## some_college SATQ 6 269 275 606.07 117.76 200 525 620 700 800 ▁▁▁▅▃▇▇▅
## some_college SATV 0 275 275 612.13 117.36 200 540 625 700 800 ▁▁▁▃▃▇▆▅
## some_hs ACT 0 45 45 27.49 6.06 15 24 27 33 36 ▂▅▁▅▇▂▅▇
## some_hs age 0 45 45 19.47 5.27 17 18 18 18 45 ▇▁▁▁▁▁▁▁
## some_hs education 0 45 45 1 0 1 1 1 1 1 ▁▁▁▇▁▁▁▁
## some_hs gender 0 45 45 1.56 0.5 1 1 2 2 2 ▆▁▁▁▁▁▁▇
## some_hs SATQ 2 43 45 607.26 122.8 230 545 650 700 799 ▁▁▂▂▂▇▅▂
## some_hs SATV 0 45 45 599.67 128.05 300 500 600 700 799 ▂▂▃▅▇▃▇▇
Suppose we wanted to see if number of males or females differed by education level. One way to accomplish this is to perform a chi-squared test of independence. In R, the chisq.test() function expects a contingency table in order to produce the appropriate statistic. A contingency table provides count data by groups.
edu_gender_table <- table(sat_act$education_fac, sat_act$gender_fac)
print(edu_gender_table)
##
## female male
## college 87 51
## graduate 95 46
## high_school 21 23
## none 30 27
## some_college 195 80
## some_hs 25 20
Next, we simply pass the contingency table to the chisq.test() function.
chi_test <- chisq.test(edu_gender_table)
chi_test
##
## Pearson's Chi-squared test
##
## data: edu_gender_table
## X-squared = 20, df = 5, p-value = 0.007
It appears as though the two variables are not independent. We will examine the pairwise comparisons in a moment to get a better idea of which variables are driving these results.
We can get the observed, expected, and residuals from the saved object.
chi_test$observed
##
## female male
## college 87 51
## graduate 95 46
## high_school 21 23
## none 30 27
## some_college 195 80
## some_hs 25 20
chi_test$expected
##
## female male
## college 89.3 48.7
## graduate 91.2 49.8
## high_school 28.5 15.5
## none 36.9 20.1
## some_college 178.0 97.0
## some_hs 29.1 15.9
We could also get the chi-squared statistic from the table summary()
summary(edu_gender_table)
## Number of cases in table: 700
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 16, df = 5, p-value = 0.007
Aside from simply printing a table, we can visualize frequencies with a mosaic plot.
vcd::mosaic(~ education_fac + gender_fac, data = sat_act, shade = TRUE, legend = TRUE)
If we want to compare each level with the other in our contingency table we can computed those with pairwise comparisons using rcompanion.
rcompanion::pairwiseNominalIndependence(edu_gender_table,
fisher = FALSE,
gtest = FALSE,
correct = "holm")
## Comparison p.Chisq p.adj.Chisq
## 1 college : graduate 0.52600 0.6610
## 2 college : high_school 0.10400 0.2600
## 3 college : none 0.23400 0.3900
## 4 college : some_college 0.13200 0.2830
## 5 college : some_hs 0.47200 0.6610
## 6 graduate : high_school 0.02970 0.1480
## 7 graduate : none 0.07440 0.2230
## 8 graduate : some_college 0.52900 0.6610
## 9 graduate : some_hs 0.20600 0.3860
## 10 high_school : none 0.77300 0.8280
## 11 high_school : some_college 0.00398 0.0597
## 12 high_school : some_hs 0.59800 0.6900
## 13 none : some_college 0.01140 0.0855
## 14 none : some_hs 0.92500 0.9250
## 15 some_college : some_hs 0.05920 0.2220
Or with pairwise.t.test. Note that this method requires that original, numeric data.
pairwise <- pairwise.t.test(sat_act$gender, sat_act$education, p.adjust.method = "holm")
# broom::tidy(pairwise)
pairwise
##
## Pairwise comparisons using t tests with pooled SD
##
## data: sat_act$gender and sat_act$education
##
## 0 1 2 3 4
## 1 1.00 - - - -
## 2 1.00 1.00 - - -
## 3 0.12 0.53 0.04 - -
## 4 1.00 1.00 0.63 1.00 -
## 5 0.53 1.00 0.22 1.00 1.00
##
## P value adjustment method: holm
Moving beyond categorical variables, we next test the relationship between numeric values using simple correlation.
The Pearson product moment correlation seeks to measure the linear association between two variables, \(x\) and \(y\) on a standardized scale ranging from \(r = -1 -- 1\).
The correlation of x and y is a covariance that has been standardized by the standard deviations of \(x\) and \(y\). This yields a scale-insensitive measure of the linear association of \(x\) and \(y\). For much more conceptual detail, see: https://psu-psychology.github.io/psy-597-SEM/01_correlation_regression/01_Correlation_and_Regression.html.
\[r_{XY}= \frac{s_{XY}}{s_{X} s_{Y}}\]
Correlation is done using the cor() function.
Suppose we want to see if the ACT scores increase with age.
# Covariance
# cov(sat_act$age, sat_act$ACT)
# Correlation
cor(sat_act$age, sat_act$ACT)
## [1] 0.111
A small correlation, but no test of significance. To obtain significance values, you must pass your data to cor.test(). Note that this can be done by passing x and y or using the formula method (which will be used for linear models).
# Default method
# cor.test(sat_act$age, sat_act$ACT)
# Formula method
cor.test(~ sat_act$age + sat_act$ACT, data = sat_act)
##
## Pearson's product-moment correlation
##
## data: sat_act$age and sat_act$ACT
## t = 3, df = 700, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0367 0.1831
## sample estimates:
## cor
## 0.111
To visualize this relationship, we can pass the raw data to ggplot and get a simple regression line using stat_smooth() with the lm method.
ggplot(sat_act, aes(x=age, y=ACT)) +
geom_jitter(width=0.1) +
stat_smooth(method="lm", se=FALSE)
You can also pass a dataframe of values to the cor function to get a simple correlation matrix (a la SPSS).
cor(sat_act[c("age","ACT","SATV","SATQ")], use = "pairwise.complete.obs")
## age ACT SATV SATQ
## age 1.0000 0.111 -0.0424 -0.0339
## ACT 0.1105 1.000 0.5611 0.5871
## SATV -0.0424 0.561 1.0000 0.6443
## SATQ -0.0339 0.587 0.6443 1.0000
Or, optionally for easier-on-the-eyes output we can use a number of specialized functions.
Hmisc::rcorr(sat_act[c("age","ACT","SATV","SATQ")] %>% as.matrix(.))
## age ACT SATV SATQ
## age 1.00 0.11 -0.04 -0.03
## ACT 0.11 1.00 0.56 0.59
## SATV -0.04 0.56 1.00 0.64
## SATQ -0.03 0.59 0.64 1.00
##
## n
## age ACT SATV SATQ
## age 700 700 700 687
## ACT 700 700 700 687
## SATV 700 700 700 687
## SATQ 687 687 687 687
##
## P
## age ACT SATV SATQ
## age 0.0034 0.2631 0.3744
## ACT 0.0034 0.0000 0.0000
## SATV 0.2631 0.0000 0.0000
## SATQ 0.3744 0.0000 0.0000
apaTables::apa.cor.table(sat_act[c("age","ACT","SATV","SATQ")])
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3
## 1. age 25.59 9.50
##
## 2. ACT 28.55 4.82 .11**
## [.04, .18]
##
## 3. SATV 612.23 112.90 -.04 .56**
## [-.12, .03] [.51, .61]
##
## 4. SATQ 610.22 115.64 -.03 .59** .64**
## [-.11, .04] [.54, .63] [.60, .69]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
The overall goal is to give you an introduction to conducting regression analyses or linear modeling in R.
Let’s turn to ‘simple’ linear regression (one predictor, one outcome), then scale to multiple regression (many predictors, one outcome). The standard linear regression model is implemented by the lm function in R. The lm function uses ordinary least squares (OLS) which estimates the parameter by minimizing the squared residuals.
In simple regression, we are interested in a relationship of the form:
\[ Y = B_0 + B_1 X \]
where \(Y\) is the dependent variable (criterion) and \(X\) is the predictor (covariate). The intercept is represented by \(B0\) and the slope for the \(X\) predictor by \(B1\).
Let’s take a look at the simple case of stopping distance (braking) as a function of car speed.
ggplot(sat.act, aes(x=SATQ, y=SATV)) +
geom_point(color='darkblue', size = 3) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='black', size=1.2) +
labs(x="Math Score", y="Verbal Score")
When conducting regression, we typically try to capture linear relationships among variables. We can introduce higher-order polynomial terms (e.g., quadratic models) or splines (more flexible shapes), but this beyond the scope here.
Fortunately, this relationship looks quite linear! The higher the math score the higher the verbal score.
In R regression models, we use the ~ operator to denote ‘regressed on’. It’s not especially intuitive, but we say the criterion is regressed on the predictor. Here, if we think speed is a key cause of stopping distance, we’d say ‘braking distance regressed on speed’ or ‘speed predicts braking distance.’
In formula terms, this is SATV ~ SATQ, which we pass as the first argument to lm().
lm_SAT <- lm(SATV ~ SATQ, data=sat.act)
summary(lm_SAT)
##
## Call:
## lm(formula = SATV ~ SATQ, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -365.9 -50.6 -3.2 54.7 257.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.1432 17.7798 12.8 <2e-16 ***
## SATQ 0.6312 0.0286 22.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.7 on 685 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.415, Adjusted R-squared: 0.414
## F-statistic: 486 on 1 and 685 DF, p-value: <2e-16
The output contains individual parameter estimates of the model (here, just the intercept and slope), their standard errors, significance tests, and p-values (one degree of freedom). We also get global information such as the sum of squared errors and the coefficient of determination (\(R^2\)).
We can also get useful diagnostic plots for free using the plot() function:
plot(lm_SAT)
The ggfortify package also provides an autoplot function that gives similar diagnostics within a handy ggplot-based graph.
autoplot(lm_SAT)
Using functionality from the car and boot packges, we can easily get estimates of the regression coefficients and standard errors using nonparametric bootstrapping, which relaxes the normal theory assumption on the standard errors and, therefore, the significance tests. Likewise, the model does not assume normally distributed error.
Nonparametric bootstrapping approximates the sampling distribution for a statistic of interest (e.g., a slope) by resampling the existing data with replacement many times and examining the resulting density.
sat.act.na <- na.omit(sat.act)
lm_SAT.na <- lm(SATV ~ SATQ, data=sat.act.na)
system.time(lm_SAT.boot <- Boot(lm_SAT.na, R=2000))
## user system elapsed
## 2.020 0.034 2.056
summary(lm_SAT.boot, high.moments=TRUE)
##
## Number of bootstrap replications R = 2000
## original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept) 227.143 -4.17e-02 20.8410 227.266 0.0513 0.198
## SATQ 0.631 7.07e-05 0.0325 0.631 -0.0430 0.191
We can use the object to obtain 95% bootstrapped confidence intervals using the ‘bias corrected, accelerated’ method (aka bca).
confint(lm_SAT.boot, level=.95, type="bca")
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 187.015 269.539
## SATQ 0.565 0.696
And we can easily compare the bootstrapped and standard OLS models:
hist(lm_SAT.boot, legend="separate")
We can easily extend to larger regression models by adding terms to the right side of the formula. For example, in the sat.act dataset, we could examine the extent to which the math scores (SATQ) is a function of both verbal scores (SATV), age (age), and sex (gender, 1 = male, 2 = female).
asv_model <- lm(SATQ ~ age + gender + SATV, sat.act)
summary(asv_model)
##
## Call:
## lm(formula = SATQ ~ age + gender + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -290.34 -50.32 5.81 51.51 295.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 273.5175 23.8614 11.46 < 2e-16 ***
## age -0.1267 0.3498 -0.36 0.72
## gender -36.8580 6.9202 -5.33 1.4e-07 ***
## SATV 0.6541 0.0293 22.33 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.8 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.436
## F-statistic: 178 on 3 and 683 DF, p-value: <2e-16
sv_model <- lm(SATQ ~ gender + SATV, sat.act)
summary(sv_model)
##
## Call:
## lm(formula = SATQ ~ gender + SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.27 -50.46 5.64 51.89 295.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.8997 21.6571 12.46 < 2e-16 ***
## gender -36.8011 6.9140 -5.32 1.4e-07 ***
## SATV 0.6545 0.0293 22.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.8 on 684 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.438, Adjusted R-squared: 0.437
## F-statistic: 267 on 2 and 684 DF, p-value: <2e-16
It appears that these are both influential predictors. We could examine the relationship graphically.
ggplot(sat.act, aes(x=SATV, y=SATQ, color=factor(gender))) +
geom_point() +
scale_color_discrete(name="Sex", breaks = c("1", "2"), labels = c("Male", "Female")) +
labs(title = "Multiple Regression", x = "SATV", y = "SATQ") +
stat_smooth(method=lm, se=FALSE)
Note that the broom package is very useful for extracting global and specific statistics from many models in R, including regression models. The introductory vignette provides a number of useful examples: https://cran.r-project.org/web/packages/broom/vignettes/broom.html. Here, what if we want to save the global statistics and parameter estimates into data.frame objects?
We can use the glance function to get the global model statistics.
broom::glance(asv_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.438 0.436 86.8 178. 3.52e-85 4 -4040. 8089. 8112. 5150990. 683
And the tidy function yields the parameter table
broom::tidy(asv_model)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 274. 23.9 11.5 6.05e-28
## 2 age -0.127 0.350 -0.362 7.17e- 1
## 3 gender -36.9 6.92 -5.33 1.36e- 7
## 4 SATV 0.654 0.0293 22.3 2.51e-83
As can imagine (and saw earlier in the functional programming overview), the ability to extract regression statistics into a tidy data.frame is a boon to scaling your analyses to multiple models and datasets.
We can use the * operator in R to ask that both the constituent variables and their interaction(s) are entered into the model. For example:
int_model <- lm(SATQ ~ gender*age*SATV, sat.act)
summary(int_model)
##
## Call:
## lm(formula = SATQ ~ gender * age * SATV, data = sat.act)
##
## Residuals:
## Min 1Q Median 3Q Max
## -293.41 -50.48 4.05 52.30 291.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.40e+02 1.85e+02 -1.29 0.196
## gender 2.17e+02 1.08e+02 2.01 0.045 *
## age 1.84e+01 7.19e+00 2.56 0.011 *
## SATV 1.37e+00 2.96e-01 4.62 4.6e-06 ***
## gender:age -8.86e+00 4.15e+00 -2.14 0.033 *
## gender:SATV -3.38e-01 1.73e-01 -1.96 0.051 .
## age:SATV -2.55e-02 1.16e-02 -2.20 0.028 *
## gender:age:SATV 1.15e-02 6.68e-03 1.73 0.084 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.2 on 679 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.45, Adjusted R-squared: 0.444
## F-statistic: 79.4 on 7 and 679 DF, p-value: <2e-16
This model includes individual effects of sex (gender) and age (age), as well as their interation (gender:age). This highlights that the asterisk operator * will compute all possible interations among the specified predictors. For example, a*b*c*d will generate all effets up through and including the a x b x c x d interation. By contrast, if you wish to specify a given interaction manually/directly, use the colon operator (e.g., a:b). The downside of the colon operator is that it doesn’t guarantee that the corresponding lower-level effects are included, which is usually a sane default position. As a reminder, you should essentially never include an interation without including the lower level effects, because this can misassign the variance.
#handy 2-way interation plotting function from jtools.
ggplot(data=sat.act,
aes(x=age, y=SATQ, colour = factor(gender))) +
geom_jitter()+
labs(x = "Age", y = "Math Score", colour = "Sex") +
theme_bw()+
theme(
plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,panel.border = element_blank()
) +
stat_smooth(method='lm', se=TRUE, fullrange=TRUE) +
scale_color_manual(labels = c("Male", "Female"), values = c("#d21959", "#4aded3"))
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14))
## List of 2
## $ axis.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 14
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
What do we see? Males tend to have higher math scores and maintain these scores with age. Women have lower maths scores and show a decreasing trend as they get older.
(Some of the code and text here has been adapted from Russell Lenth’s excellent emmeans documentation: https://cran.r-project.org/web/packages/emmeans/)
One of the handiest packages in the R regression universe is emmeans, which can provide the ‘expected marginal means’ (em means), as well as a host of other contrasts and comparisons. In particular, it is very easy to test simple slopes and pairwise differences. Furthermore, the package works with multcomp to handle correction for multiple comparisons. See the longer documentation here.
Let’s look at the concentration of leucine in a study of pigs who were fed differing levels of protein in the diet (9, 12, 15, and 18%) and different protein sources (fish, soybean, milk). The concentration has a long positive tail, so here we log transform it to normalize things somewhat.
sat.act$agef[sat.act$age < 25] <- 1
sat.act$agef[sat.act$age >= 25 & sat.act$age <= 50] <- 2
sat.act$agef[sat.act$age > 50] <- 3
sat.act2 <- sat.act
sat.act2$agef <- as.factor(sat.act2$agef)
sat.act2$gender <- as.factor(sat.act2$gender)
sat.lm <- lm(SATQ ~ agef + SATV, data = sat.act2)
summary(sat.lm)
##
## Call:
## lm(formula = SATQ ~ agef + SATV, data = sat.act2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -301.62 -45.92 2.07 51.81 283.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 206.9031 18.8894 10.95 <2e-16 ***
## agef2 -0.0946 7.1350 -0.01 0.99
## agef3 15.5437 18.9726 0.82 0.41
## SATV 0.6579 0.0299 22.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.6 on 683 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.416, Adjusted R-squared: 0.413
## F-statistic: 162 on 3 and 683 DF, p-value: <2e-16
This output is hard to look at because there are many dummy codes and we have to infer the reference condition for each factor (usually alphabetical). Also, we do not have an intuitive sense of the expected means in each condition because they depend on the sum of the intercept and the specific dummy code for the condition interest, averaging over the other factor.
We can obtain the expected means for each condition.
sat.emm.s <- emmeans(sat.lm, "agef")
print(sat.emm.s)
## agef emmean SE df lower.CL upper.CL
## 1 610 4.32 683 601 618
## 2 610 5.67 683 598 621
## 3 625 18.47 683 589 662
##
## Confidence level used: 0.95
sat.emm.p <- emmeans(sat.lm, "SATV")
print(sat.emm.p)
## SATV emmean SE df lower.CL upper.CL
## 612 615 4.32 683 606 623
##
## Results are averaged over the levels of: agef
## Confidence level used: 0.95
print(emmeans(sat.lm, ~agef*SATV))
## agef SATV emmean SE df lower.CL upper.CL
## 1 612 610 4.32 683 601 618
## 2 612 610 5.67 683 598 621
## 3 612 625 18.47 683 589 662
##
## Confidence level used: 0.95
If we wanted to compare the pairwise differences in the effect of protein source on leucine concentration while controlling for protein percentage (and potentially other variables we add to the model), we could use the pairs function:
sat_pairs <- pairs(sat.emm.s)
print(sat_pairs)
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.09 7.13 683 0.013 1.0000
## 1 - 3 -15.54 18.97 683 -0.819 0.6910
## 2 - 3 -15.64 19.32 683 -0.809 0.6970
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Note that you can get a sense of the contrasts being tested by emmeans by examining the @linfct slot of the object. I’ve learned a lot by examining these contrast matrices and thinking about how to setup a (focal) contrast of interest. Also note that you get p-value adjustment for free (here, Tukey’s HSD method).
Contrasts for the predicted mean level of leucine contrast for each protein source, controlling for protein percentage.
sat.emm.s@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 1 0 0 612
## [2,] 1 1 0 612
## [3,] 1 0 1 612
What are the pairwise contrasts for the protein sources?
sat_pairs@linfct
## (Intercept) agef2 agef3 SATV
## [1,] 0 -1 0 0
## [2,] 0 0 -1 0
## [3,] 0 1 -1 0
The emmeans package also provides useful plots to understand pairwise differences:
plot(sat.emm.s, comparisons = TRUE)
The blue bars are confidence intervals for the EMMs, and the red arrows are for the comparisons among them. If an arrow from one mean overlaps an arrow from another group, the difference is not significant, based on the adjust setting (which defaults to “tukey”). (Note: Don’t ever use confidence intervals for EMMs to perform comparisons; they can be very misleading.)
Consider a model in which we examine the association between SATQ and SATV across age. Here, we regress SATV on SATQ, age (three levels), and their interaction.
fit_sat <- lm(SATQ ~ SATV * agef, data = sat.act2)
summary(fit_sat)
##
## Call:
## lm(formula = SATQ ~ SATV * agef, data = sat.act2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -302.51 -50.21 2.09 54.61 256.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.8432 22.1853 8.06 3.4e-15 ***
## SATV 0.7034 0.0354 19.90 < 2e-16 ***
## agef2 109.8710 42.6641 2.58 0.0102 *
## agef3 17.2861 95.7898 0.18 0.8568
## SATV:agef2 -0.1804 0.0690 -2.61 0.0091 **
## SATV:agef3 -0.0022 0.1547 -0.01 0.9887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88.3 on 681 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.422, Adjusted R-squared: 0.417
## F-statistic: 99.3 on 5 and 681 DF, p-value: <2e-16
car::Anova(fit_sat, type="III") #overall effects of predictors in the model
## Anova Table (Type III tests)
##
## Response: SATQ
## Sum Sq Df F value Pr(>F)
## (Intercept) 506336 1 64.99 3.4e-15 ***
## SATV 3084321 1 395.85 < 2e-16 ***
## agef 51806 2 3.32 0.037 *
## SATV:agef 53928 2 3.46 0.032 *
## Residuals 5306050 681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this yields a categorical (species) x continuous (petal width) interaction. The output from car::Anova indicates that the interaction is significant, but we need more detailed guidance on how the slope for petal width is moderated by species. We can visualize the interaction as follows:
ggplot(data=sat.act2,
aes(x=SATV, y=SATQ, colour = factor(agef))) +
geom_jitter()+
labs(x = "Verbal Score", y = "Math Score", colour = "Age") +
theme_bw()+
theme(
plot.background = element_blank()
,panel.grid.major = element_blank()
,panel.grid.minor = element_blank()
,panel.border = element_blank()
) +
stat_smooth(method='lm', se=FALSE, fullrange=TRUE) +
scale_color_manual(labels = c("Under 25", "25-50", "Over 50"), values = c("red", "green", "blue"))
theme(axis.text=element_text(size=12),
axis.title=element_text(size=14))
## List of 2
## $ axis.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 14
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
In a simple slopes test, we might wish to know whether the slope for Petal.Width is non-zero in each species individually. Let’s start by getting the estimated marginal means for each species.
emmeans(fit_sat, ~agef)
## NOTE: Results may be misleading due to involvement in interactions
## agef emmean SE df lower.CL upper.CL
## 1 610 4.31 681 601 618
## 2 609 5.66 681 598 620
## 3 626 18.43 681 589 662
##
## Confidence level used: 0.95
And pairwise differences between species:
pairs(emmeans(fit_sat, ~agef))
## NOTE: Results may be misleading due to involvement in interactions
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.62 7.11 681 0.087 0.9960
## 1 - 3 -15.94 18.92 681 -0.842 0.6770
## 2 - 3 -16.56 19.28 681 -0.859 0.6660
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Transitioning to SATV, because we are interested its linear effect (slope), we use the emtrends function to estimate the slope in each species individually. In terms of simple slopes, we test whether the SATV slope is non-zero in each age group. The infer argument in the summary of emtrends requests t-tests and p-values for the slopes.
summary(emtrends(model = fit_sat, ~agef, var="SATV"), infer=TRUE)
## agef SATV.trend SE df lower.CL upper.CL t.ratio p.value
## 1 0.703 0.0354 681 0.634 0.773 19.900 <.0001
## 2 0.523 0.0593 681 0.407 0.639 8.820 <.0001
## 3 0.701 0.1506 681 0.406 0.997 4.660 <.0001
##
## Confidence level used: 0.95
Finally, we could examine pairwise differences between slopes among age groups.
pairs(emtrends(model = fit_sat, ~agef, var="SATV"))
## contrast estimate SE df t.ratio p.value
## 1 - 2 0.1804 0.069 681 2.614 0.0250
## 1 - 3 0.0022 0.155 681 0.014 1.0000
## 2 - 3 -0.1782 0.162 681 -1.101 0.5140
##
## P value adjustment: tukey method for comparing a family of 3 estimates
In the sat.act dataset, we have treated age as a factor. If we keep this representation (as opposed to entering age as continuous), we can easily get orthogonal polynomial contrasts in emmeans. For example, is the effect of age linearly related to SATV, or might it be quadratic or cubic?
sat.emm.p <- emmeans(sat.lm, "agef")
ply <- contrast(sat.emm.p, "poly")
ply
## contrast estimate SE df t.ratio p.value
## linear 15.5 19.0 683 0.819 0.4130
## quadratic 15.7 22.1 683 0.712 0.4770
coef(ply) #show the contrast coefficients
## agef c.1 c.2
## 1 1 -1 1
## 2 2 0 -2
## 3 3 1 1
There is a lot more on probing interations here: https://cran.r-project.org/web/packages/emmeans/vignettes/interactions.html.
Finally, we can examine effects in multivariate regression models (i.e., multiple DVs). Here, we can examine the both the verbal and math scores and if they are associated with age, gender (sex), education, or ACT scores.
org.int <- lm(cbind(SATV, SATQ) ~ age + gender + education + ACT, data = sat.act2)
summary(org.int)
## Response SATV :
##
## Call:
## lm(formula = SATV ~ age + gender + education + ACT, data = sat.act2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -502.0 -47.9 7.6 60.1 239.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.2947 23.3033 11.21 <2e-16 ***
## age -1.3893 0.4537 -3.06 0.0023 **
## gender2 -0.0713 7.5002 -0.01 0.9924
## education 1.4926 3.0573 0.49 0.6256
## ACT 13.3790 0.7487 17.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.3 on 682 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.326, Adjusted R-squared: 0.322
## F-statistic: 82.3 on 4 and 682 DF, p-value: <2e-16
##
##
## Response SATQ :
##
## Call:
## lm(formula = SATQ ~ age + gender + education + ACT, data = sat.act2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -492.5 -48.0 8.2 63.3 257.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 259.647 22.891 11.34 < 2e-16 ***
## age -1.352 0.446 -3.03 0.0025 **
## gender2 -34.805 7.367 -4.72 2.8e-06 ***
## education 1.102 3.003 0.37 0.7138
## ACT 14.155 0.735 19.25 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.7 on 682 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.375, Adjusted R-squared: 0.372
## F-statistic: 102 on 4 and 682 DF, p-value: <2e-16
The overall goal is to give you a very quick introduction to conducting one-way ANOVA and two-way ANOVA in R.
Analysis of variance (ANOVA) provides a statistical test of whether two or more population means are equal. It is based on the law of total variance, where the observed variance in particular variable is partitioned into components attributable to differnt sources of variation. When the population means are equal, the differences among the group means reflect the operation of experimental error alone (within-groups deviation). When the population means are not equal, the differences among the group means will reflect the operation of both experimental error (within-groups deviation) and treatment effects (between-group deviation). In ANOVA, we produce the ratio of:
\[\frac{S^2_{treatment} + S^2_{error}}{S^2_{error}}\]
Variance = (sum of square deviation from the mean)/(degrees of freedom)=SS/df Sum of Squares: \[SS_{Total}= SS_{between}+SS_{within}\] \[\sum(Y_{i.j}-\bar{Y_T})= \sum(\bar{Y_{ai}-\bar{Y_T})+\sum(Y_{i.j}-\bar{Y_{ai})\] Degrees of freedom: \[df_{Total}= df_{between}+df_{within}\]
The one-way ANOVA is an extension of independent two-sample t-test for comparing means in a situation where there are two groups or more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).
- Independence of observations
- Normality (normal distributions of the residuals)
- Homoscendasticity (equal variances for all groups)
df<-ToothGrowth
df$dose <- factor(df$dose,
levels = c(0.5, 1, 2),
labels = c("D0.5", "D1", "D2"))
kruskal.test(len ~ supp, data = df)
##
## Kruskal-Wallis rank sum test
##
## data: len by supp
## Kruskal-Wallis chi-squared = 3, df = 1, p-value = 0.06
From the output above we can see that the p-value is not less than the significance level of 0.05.
leveneTest(len ~ supp, data = df)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.21 0.28
## 58
From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
table(df$supp) # there are three levels
##
## OJ VC
## 30 30
group_by(df, supp) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
sd = sd(len, na.rm = TRUE)
)
## # A tibble: 2 x 4
## supp count mean sd
## <fct> <int> <dbl> <dbl>
## 1 OJ 30 20.7 6.61
## 2 VC 30 17.0 8.27
# Compute the analysis of variance
res.aov <- aov(len ~ supp, data = df)
# Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205 205 3.67 0.06 .
## Residuals 58 3247 56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value is more than the significance level 0.05, we can conclude that there are no significant differences between the groups.
Plot weight by group and color by group
ggplot(df, aes(x=supp, y=len)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) +
geom_jitter(shape=16, position=position_jitter(0.2))
plot(res.aov, 1)
In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. Points 23, 26 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.
plot(res.aov, 2)
QQ-plot is used to check the assumption that the residuals are normally distributed. It should approximately follow a straight line.
In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant. Example: Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD())
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = len ~ supp, data = df)
##
## $supp
## diff lwr upr p adj
## VC-OJ -3.7 -7.57 0.167 0.06
diff: difference between means of the two groups lwr, upr: the lower and the upper end point of the confidence interval at 95% (default) p adj: p-value after adjustment for the multiple comparisons.
Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. The number of level may vary in each factor.
table(df$supp, df$dose)
##
## D0.5 D1 D2
## OJ 10 10 10
## VC 10 10 10
group_by(df, supp, dose) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
sd = sd(len, na.rm = TRUE)
)
## # A tibble: 6 x 5
## # Groups: supp [2]
## supp dose count mean sd
## <fct> <fct> <int> <dbl> <dbl>
## 1 OJ D0.5 10 13.2 4.46
## 2 OJ D1 10 22.7 3.91
## 3 OJ D2 10 26.1 2.66
## 4 VC D0.5 10 7.98 2.75
## 5 VC D1 10 16.8 2.52
## 6 VC D2 10 26.1 4.80
In two-way ANOVA, we assume the sample sizes within cells are equal. If not, the ANOVA test should be handled differently (Type-III sums of squares).
res.aov2 <- aov(len ~ supp + dose, data = df)
summary(res.aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205 205 14.0 0.00043 ***
## dose 2 2426 1213 82.8 < 2e-16 ***
## Residuals 56 820 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA table we can conclude that both supp and dose are statistically significant, highlighted with *. Dose is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length.
res.aov3 <- aov(len ~ supp * dose, data = df)
summary(res.aov3)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205 205 15.57 0.00023 ***
## dose 2 2426 1213 92.00 < 2e-16 ***
## supp:dose 2 108 54 4.11 0.02186 *
## Residuals 54 712 13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be seen that the two main effects (supp and dose) are statistically significant, as well as their interaction.
Plot tooth length (“len”) by groups (“dose”) Color box plot by a second group: “supp”
data_summary <- function(data, varname, groupnames){
require(plyr)
summary_func <- function(x, col){
c(mean = mean(x[[col]], na.rm=TRUE),
sd = sd(x[[col]], na.rm=TRUE))
}
data_sum<-ddply(data, groupnames, .fun=summary_func,
varname)
data_sum <- rename(data_sum, c("mean" = varname))
return(data_sum)
}
df3 <- data_summary(df, varname="len",
groupnames=c("supp", "dose"))
## Loading required package: plyr
## ----------------------------------------------------------------------------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ----------------------------------------------------------------------------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
## The following object is masked from 'package:purrr':
##
## compact
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) +
geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1,
position=position_dodge(0.05)) +
geom_line() + geom_point()+
scale_color_brewer(palette="Paired")+theme_minimal()
It is to determine if the mean difference between specific pairs of group are statistically significant.
TukeyHSD(res.aov3, which = "dose")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = len ~ supp * dose, data = df)
##
## $dose
## diff lwr upr p adj
## D1-D0.5 9.13 6.36 11.90 0
## D2-D0.5 15.50 12.73 18.26 0
## D2-D1 6.37 3.60 9.13 0
t-test: comparing means between two groups in a sample (t.test function) Linear regression: comparing one or more factors in a sample (lm or lmer function)
For this hands on practice, load the spi data set from the psych package.
Check the help file, structure, and first few observations of the data
Change the integer categorical variables to factors by cross-referencing the help page and filling in the pipeline below (note when knitting this document, you may want to change eval to TRUE once you complete this exercise)
spi <- spi %>%
mutate(.,
sex_fac = case_when(),
health_fac = case_when(),
education_fac = case_when(),
smoke_fac = case_when(),
exer_fac = case_when()
)
Use either describe() or skim() to describe the data by each categorical variable
Describe the data by both categorical variables at once
Perform a chi-squared test assessing whether there is a differences in smoking by education level
Get the covariance and correlation matrix for the relationship between three numeric values of your choosing
Perform a test of significance on two of the variables selected above